AOSTE - 2016 - Annual activity report

AOSTE

AOSTE - 2016

Project-Team Aoste

Members

Overall Objectives

Embedded System Design

Research Program

Application Domains

New Software and Platforms

New Results

Bilateral Contracts and Grants with Industry

Bilateral Contracts with Industry

Partnerships and Cooperations

Dissemination

Bibliography

Previous |

Home | Next next

Section: New Results

Performance analysis and optimisation of an HPC scientific application

Participants : Luis Agustin Nieto, Sid Touati.

In the context of the internatinal Internship of Luis Agustin Nieto we conducted a large-scale experiment of source code optimization for HPC application. This work is meant to identify potential approaches that may be automatized in the future. The current use case was an application named CONVIV. CONVIV is a computer code implementing the VMFCI Method to solve the stationary Schrödinger equation for a set of distinguishable degrees of freedom (https://svn.oca.eu/trac/conviv). It is used in Chemistry for computing the energy levels of molecules.

This application is very computer-intensive (many hours of computation on a high performance grid computer). We have been given its source code (fortran with OpenMP), and we have been asked to analyse its performance and to optimise its execution time.

We did an extensive set of experiments for this application on many computers, and mainly on the cicada.unice.fr shared grid computer used for scientific parallel computing at UNS). We varied many parameters in our experiments:

The number of threads was 2, 4, 6, 8, 16 threads. We also analysed the sequential code version.
The thread affinity strategies for scheduling were: none (linux scheduler), scatter, compact.
We repeated each experience 35 times to analyse performance stability.
We used 2 compilers (gfortran, ifort) with -O3.
We did a precise performance profiling using the Intel Vtune tool.

During our experiments we observed that, even with all the parameters above kept fixed, repeating the executions 35 times shows grat variability between best and worst execution times (more than double in some cases). The critical-path functions remained the same for each configuration choice, including in particular specific matrix computation functions.

After investigation and experiments, we succeed in getting a spectacular performance improvement by applying the following optimisations:

Replace one of the matrix computation function by an MKL one (highly optimised and tuned function done by Intel).
Use the compact thread scheduling strategy (OpenMP parameter).
By using gfortran compiler with -O3, we reduced the execution time from 18400 seconds to 820 seconds (speedup=22).
By using the ifort compiler with -03, we reduced the execution time from 21000 seconds to 620 seconds (speedup=33).

Previous |

Home | Next next